Maximum entropy good-turing estimator for language modeling
نویسندگان
چکیده
In this paper, we propose a new formulation of the classical Good-Turing estimator for -gram language model. The new approach is based on defining a dynamic model for language production. Instead of assuming a fixed probability distribution of occurrence of an -gram on the whole text, we propose a maximum entropy approximation of a time varying distribution. This approximation led us to a new distribution, which in turn is used to calculate expectations of the Good-Turing estimator. This defines a new estimator that we call Maximum Entropy Good-Turing estimator. Contrary to the classical Good-Turing estimator it needs neither expectations approximations nor windowing or other smoothing techniques. It also contains the well know discounting estimators as special cases. Performance is evaluated both in terms of perplexity and word error rate in an N-best re-scoring task. Also comparison to other classical estimators is performed. In all cases our approach performs significantly better than classical estimators.
منابع مشابه
Distribution-Dependent Performance of the Good-Turing Estimator for the Missing Mass
The Good-Turing estimator for the missing mass has certain bias and concentration properties which define its performance. In this paper we give distribution-dependent conditions under which this performance can or cannot be matched by a trivial estimator, that is one which does not depend on observation. We introduce the notion of accrual function for a distribution, and derive our conditions ...
متن کاملEmpirical Evaluation and Combination of Advanced Language Modeling Techniques
We present results obtained with several advanced language modeling techniques, including class based model, cache model, maximum entropy model, structured language model, random forest language model and several types of neural network based language models. We show results obtained after combining all these models by using linear interpolation. We conclude that for both small and moderately s...
متن کاملLearning Theory and Language Modeling
Here we consider some of our recent work on Good-Tuing estimators in the larger context of learning theory and language modeling. The Good-turing estimators have played a signiicant role in natural language modeling for the past twenty years. We have recently shown that these particular leave-one-out estimators converge rapidly. Here we present these results and consider possible consequences f...
متن کاملOn Inference about Rare Events
Despite the increasing volume of data in modern statistical applications, critical patterns and events have often little, if any, representation. This is not unreasonable, given that such variables are critical precisely because they are rare. We then have to raise the natural question: when can we infer something meaningful in such contexts? The focal point of this thesis is the archetypal pro...
متن کاملCoverage-adjusted entropy estimation.
Data on 'neural coding' have frequently been analyzed using information-theoretic measures. These formulations involve the fundamental and generally difficult statistical problem of estimating entropy. We review briefly several methods that have been advanced to estimate entropy and highlight a method, the coverage-adjusted entropy estimator (CAE), due to Chao and Shen that appeared recently in...
متن کامل